sa2.png

Natural Language Processing: US Airline Twitter Sentiment Analysis¶

General Overview¶

Background & Context¶

Twitter's massive user base of 330 million monthly active users presents a direct avenue for businesses to connect with a broad audience. However, the vast amount of information on the platform makes it challenging for brands to swiftly detect negative social mentions that may impact their reputation. To tackle this, sentiment analysis has become a crucial tool in social media marketing, enabling businesses to monitor emotions in conversations, understand customer sentiments, and gain insights to stay ahead in their industry.

That's why sentiment analysis/classification, which involves monitoring emotions in conversations on social media platforms, has become a key strategy in social media marketing.

Objective¶

The aim of this project is to build a sentimental analysis model that classifies the sentiment of tweets of US Airlines customers into positive, neutral & negative.

Data Dictionary¶

  • tweet_id - A unique identifier for each tweet
  • airline_sentiment - The sentiment label of the tweet, such as positive, negative, or neutral
  • airline_sentiment_confidence - The confidence level associated with the sentiment label
  • negativereason - A category indicating the reason for negative sentiment
  • negativereason_confidence - The confidence level associated with the negative reason
  • airline - The airline associated with the tweet
  • airline_sentiment_gold - Gold standard sentiment label
  • name - The username of the tweet author
  • retweet_count - The number of times the tweet has been retweeted
  • text - The actual text content of the tweet.
  • tweet_coord - Coordinates of the tweet
  • tweet_created - The timestamp when the tweet was created
  • tweet_location - The location mentioned in the tweet
  • user_timezone - The timezone of the tweet author

Approach¶

I will employ a systematic approach to develop and choose the best model for this Twitter Sentiment Analysis task.

We will use Bag of Words (BoW) with Random Forrest Classifier, TF-IDF with Random Forrest Classifier & Keras Tokenizer with Long-Short Term Memory (LSTM) & GLoVE with LSTM and then we choose the best model.

Our approach will involve the following key steps:

Sanity Checks:¶

  • We will load the necessary libraries
  • Load the dataset
  • Check for and fix missing and duplicate values

Exploratory Data Analysis:¶

  • EDA will be performed on the dataset to unvail insights.
  • Explore and visualize the dataset to gain insights into the distribution of sentiment classes and potential data imbalances.

Data Preprocessing:¶

  • Preprocess the text data, including tasks like removing HTML tags, handling special characters, tokenization, lowercasing, and removing stop words.
  • Additionally, I will use lemmatization technique to reduce each word to the barest form.

Vectorization & Embedding:¶

  • Convert the preprocessed text data into numerical features suitable for machine learning models. I will consider three main feature representations:

    ◎ Bag of Words (BoW): Transform the text data into a matrix of word frequencies.

    ◎ TF-IDF (Term Frequency-Inverse Document Frequency): Represent the text data using TF-IDF scores to give importance to rare words.

    ◎ Word Embeddings - GloVe: Capture semantic relationships between words in dense vector representations.

Model Building and Evaluation:¶

  • We will be using 2 models (Random Forest & LSTM Neural Network), 3 vectorizers (BoW, TF-IDF & Keras Tokenizer) and 1 word embedding techinique (GloVe) to build 4 models for our sentiment analysis:
    • Random Forest (RF) with BoW features.
    • Random Forest (RF) with TF-IDF features.
    • Long Short-Term Memory (LSTM) neural network with Keras Tokenizer.
    • LSTM neural network with GloVe word embeddings.
  • We will be using Accuracy as our evaluation metric as this gives us a balanced view of model performance.

Model Optimization:¶

  • We will fine-tune each model to improve performance.

Model Selection:¶

  • We compare the performance of all models based on evaluation metric of accuracy.
  • Then we select the best-performing model on unseen data as the final sentiment analysis model.

Sanity Checks¶

Importing Necessary Libraries¶

In [1]:
# install and import necessary libraries.

import numpy as np  # Import numpy.
import pandas as pd  # Import pandas.

import matplotlib.pyplot as plt  # Import Matplotlib

# To help with data visualization
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns  # Import seaborn

sns.set(
    color_codes=True
)  # -----This adds a background color to all the plots created using seaborn

# Allow the use of Display via interactive Python
from IPython.display import display

# Import library for exploratory visualization of missing data.
import missingno as ms

from sklearn.feature_extraction.text import CountVectorizer  # Import count Vectorizer
from sklearn.feature_extraction.text import TfidfVectorizer  # Import Tf-Idf vector
from sklearn.model_selection import train_test_split  # Import train test split
from sklearn.ensemble import RandomForestClassifier  # Import Rndom Forest Classifier
from sklearn.model_selection import cross_val_score  # Import cross val score
from sklearn.metrics import confusion_matrix  # Import confusion matrix

from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

from sklearn.preprocessing import LabelEncoder  # To convert target variable to numeric

from tensorflow.keras.metrics import Precision, Recall
import re, string, unicodedata  # Import Regex, string and unicodedata.
import contractions  # Import contractions library.
from bs4 import BeautifulSoup  # Import BeautifulSoup.

import nltk  # Import Natural Language Tool-Kit.
from nltk.corpus import stopwords  # Import stopwords.
from nltk.tokenize import word_tokenize, sent_tokenize  # Import Tokenizer.
from nltk.stem.wordnet import WordNetLemmatizer  # Import Lemmatizer.

# nltk.download("omw-1.4") # Package omw-1.4 is already up-to-date!
# nltk.download("stopwords")  # Package stopwords is already up-to-date!
# nltk.download("punkt") # Package punkt is already up-to-date!
# nltk.download("wordnet") # Package wordnet is already up-to-date!

from wordcloud import WordCloud, STOPWORDS  # Import WorldCloud and Stopwords

from tensorflow.keras.preprocessing.text import Tokenizer  # Import Keras Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences  # Import padding

# Sequential: This allows us to create a linear stack of layers for building neural networks
from tensorflow.keras.models import Sequential

# Import the different layers we will use in our sequential neural network
from tensorflow.keras.layers import (
    Embedding,
    Bidirectional,
    LSTM,
    Dense,
    Dropout,
    SpatialDropout1D,
)

from tensorflow.keras.optimizers import Adam  # Importing the Adam optimizer algorithm

# Import the different callbacks to be used
from tensorflow.keras.callbacks import (
    LearningRateScheduler,
    History,
    EarlyStopping,
    ModelCheckpoint,
)


# To supress warnings
import warnings

warnings.filterwarnings("ignore")


# Making the Python code more structured automatically
%reload_ext nb_black

# Define maximum number of columns to be displayed in a dataframe
pd.set_option("display.max_columns", None)
2023-09-15 00:13:40.867542: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  SSE4.1 SSE4.2
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.

Loading the dataset¶

In [2]:
data = pd.read_csv("Tweets.csv")  # Code to read the dataset

# Making a copy of the data to avoid any changes to original data
df = data.copy()

print("Loading Dataset... Done.")
Loading Dataset... Done.

Data Overview¶

Check the head and tail of the data¶

In [3]:
# Checking the top 5, botom 5 and 10 random rows

display(df.head())  # -----looking at head (top 5 observations)
display(df.tail())  # -----looking at tail (bottom 5 observations)
display(
    df.sample(5, random_state=1)
)  # -----5 random sample of observations from the data
tweet_id airline_sentiment airline_sentiment_confidence negativereason negativereason_confidence airline airline_sentiment_gold name negativereason_gold retweet_count text tweet_coord tweet_created tweet_location user_timezone
0 570306133677760513 neutral 1.0000 NaN NaN Virgin America NaN cairdin NaN 0 @VirginAmerica What @dhepburn said. NaN 2015-02-24 11:35:52 -0800 NaN Eastern Time (US & Canada)
1 570301130888122368 positive 0.3486 NaN 0.0000 Virgin America NaN jnardino NaN 0 @VirginAmerica plus you've added commercials t... NaN 2015-02-24 11:15:59 -0800 NaN Pacific Time (US & Canada)
2 570301083672813571 neutral 0.6837 NaN NaN Virgin America NaN yvonnalynn NaN 0 @VirginAmerica I didn't today... Must mean I n... NaN 2015-02-24 11:15:48 -0800 Lets Play Central Time (US & Canada)
3 570301031407624196 negative 1.0000 Bad Flight 0.7033 Virgin America NaN jnardino NaN 0 @VirginAmerica it's really aggressive to blast... NaN 2015-02-24 11:15:36 -0800 NaN Pacific Time (US & Canada)
4 570300817074462722 negative 1.0000 Can't Tell 1.0000 Virgin America NaN jnardino NaN 0 @VirginAmerica and it's a really big bad thing... NaN 2015-02-24 11:14:45 -0800 NaN Pacific Time (US & Canada)
tweet_id airline_sentiment airline_sentiment_confidence negativereason negativereason_confidence airline airline_sentiment_gold name negativereason_gold retweet_count text tweet_coord tweet_created tweet_location user_timezone
14635 569587686496825344 positive 0.3487 NaN 0.0000 American NaN KristenReenders NaN 0 @AmericanAir thank you we got on a different f... NaN 2015-02-22 12:01:01 -0800 NaN NaN
14636 569587371693355008 negative 1.0000 Customer Service Issue 1.0000 American NaN itsropes NaN 0 @AmericanAir leaving over 20 minutes Late Flig... NaN 2015-02-22 11:59:46 -0800 Texas NaN
14637 569587242672398336 neutral 1.0000 NaN NaN American NaN sanyabun NaN 0 @AmericanAir Please bring American Airlines to... NaN 2015-02-22 11:59:15 -0800 Nigeria,lagos NaN
14638 569587188687634433 negative 1.0000 Customer Service Issue 0.6659 American NaN SraJackson NaN 0 @AmericanAir you have my money, you change my ... NaN 2015-02-22 11:59:02 -0800 New Jersey Eastern Time (US & Canada)
14639 569587140490866689 neutral 0.6771 NaN 0.0000 American NaN daviddtwu NaN 0 @AmericanAir we have 8 ppl so we need 2 know h... NaN 2015-02-22 11:58:51 -0800 dallas, TX NaN
tweet_id airline_sentiment airline_sentiment_confidence negativereason negativereason_confidence airline airline_sentiment_gold name negativereason_gold retweet_count text tweet_coord tweet_created tweet_location user_timezone
8515 568198336651649027 positive 1.0000 NaN NaN Delta NaN GenuineJack NaN 0 @JetBlue I'll pass along the advice. You guys ... NaN 2015-02-18 16:00:14 -0800 Massachusetts Central Time (US & Canada)
3439 568438094652956673 negative 0.7036 Lost Luggage 0.7036 United NaN vina_love NaN 0 @united I sent you a dm with my file reference... NaN 2015-02-19 07:52:57 -0800 ny Quito
6439 567858373527470080 positive 1.0000 NaN NaN Southwest NaN Capt_Smirk NaN 0 @SouthwestAir Black History Commercial is real... NaN 2015-02-17 17:29:21 -0800 La Florida Eastern Time (US & Canada)
5112 569336871853170688 negative 1.0000 Late Flight 1.0000 Southwest NaN scoobydoo9749 NaN 0 @SouthwestAir why am I still in Baltimore?! @d... [39.1848041, -76.6787131] 2015-02-21 19:24:22 -0800 Tallahassee, FL America/Chicago
5645 568839199773732864 positive 0.6832 NaN NaN Southwest NaN laurafall NaN 0 @SouthwestAir SEA to DEN. South Sound Volleyba... NaN 2015-02-20 10:26:48 -0800 NaN Pacific Time (US & Canada)

Observations

  • From the first & last few rows and the sample rows, the dataset has been loaded properly

  • airline_sentiment is our target variable and it will be converted to numerical digits.

  • We will drop rows like tweet_id, name... as they will add no value to our models.

Understand the shape of the dataset¶

In [4]:
df.shape  # Code to get the shape of data
Out[4]:
(14640, 15)

Observations

  • There are 14,640 rows and 15 columns in the dataset

Checking the Data Types and General Information of the Dataset¶

In [5]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 14640 entries, 0 to 14639
Data columns (total 15 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   tweet_id                      14640 non-null  int64  
 1   airline_sentiment             14640 non-null  object 
 2   airline_sentiment_confidence  14640 non-null  float64
 3   negativereason                9178 non-null   object 
 4   negativereason_confidence     10522 non-null  float64
 5   airline                       14640 non-null  object 
 6   airline_sentiment_gold        40 non-null     object 
 7   name                          14640 non-null  object 
 8   negativereason_gold           32 non-null     object 
 9   retweet_count                 14640 non-null  int64  
 10  text                          14640 non-null  object 
 11  tweet_coord                   1019 non-null   object 
 12  tweet_created                 14640 non-null  object 
 13  tweet_location                9907 non-null   object 
 14  user_timezone                 9820 non-null   object 
dtypes: float64(2), int64(2), object(11)
memory usage: 1.7+ MB

Observations

  • There are missing values in a number of columns.
  • 2 Columns are of type Float, 2 are of type Integer and 11 are of type Object.

Checking Unique Values of Some Categorical Variables¶

In [6]:
# Few Select Categorical Columns
cat_cols = ["airline", "airline_sentiment", "airline_sentiment_gold", "negativereason"]

# Check the unique values of select categorical variables
for i in cat_cols:
    print("Unique values % in", i, "are :")
    print(df[i].value_counts(normalize=True) * 100)
    print("*" * 50)
    print("\n")
Unique values % in airline are :
United            26.106557
US Airways        19.897541
American          18.845628
Southwest         16.530055
Delta             15.177596
Virgin America     3.442623
Name: airline, dtype: float64
**************************************************


Unique values % in airline_sentiment are :
negative    62.691257
neutral     21.168033
positive    16.140710
Name: airline_sentiment, dtype: float64
**************************************************


Unique values % in airline_sentiment_gold are :
negative    80.0
positive    12.5
neutral      7.5
Name: airline_sentiment_gold, dtype: float64
**************************************************


Unique values % in negativereason are :
Customer Service Issue         31.706254
Late Flight                    18.141207
Can't Tell                     12.965788
Cancelled Flight                9.228590
Lost Luggage                    7.888429
Bad Flight                      6.319460
Flight Booking Problems         5.763783
Flight Attendant Complaints     5.240793
longlines                       1.939420
Damaged Luggage                 0.806276
Name: negativereason, dtype: float64
**************************************************


Observations

  • United Airlines has most of the Sentiments (26%). Virgin America has the least (3%)
  • Airline sentiments, both airline_sentiment & airline_sentiment_gold, are mostly negative with airline_sentiment having 63% negative sentiments and airline_sentiment_gold 80%
  • Customer Service Issue dominates the negativereason at about 32% followed by Late Flight with about 18%
  • It's interesting that Cancelled Flight, Lost Luggage, Bad Flight, Longlines, Damaged Luggage are all under 10%

Checking for Missing & Duplicate Values¶

Missing Values¶

In [7]:
# Checking missing values across each columns

c_missing = pd.Series(df.isnull().sum(), name="Missing Count")  # -----Count Missing

p_missing = pd.Series(
    round(df.isnull().sum() / df.shape[0] * 100, 2), name="% Missing"
)  # -----Percentage Missing


# Combine the Count and Percentage into 1 Dataframe
missing_df = pd.concat([c_missing, p_missing], axis=1)

missing_df.sort_values(by="% Missing", ascending=False).style.background_gradient(
    cmap="YlOrRd"
)
Out[7]:
  Missing Count % Missing
negativereason_gold 14608 99.780000
airline_sentiment_gold 14600 99.730000
tweet_coord 13621 93.040000
negativereason 5462 37.310000
user_timezone 4820 32.920000
tweet_location 4733 32.330000
negativereason_confidence 4118 28.130000
tweet_id 0 0.000000
airline_sentiment 0 0.000000
airline_sentiment_confidence 0 0.000000
airline 0 0.000000
name 0 0.000000
retweet_count 0 0.000000
text 0 0.000000
tweet_created 0 0.000000
In [8]:
# Visual Exploration of Missing Values
# Plot missing values across each columns
plt.title("Missing Values Graph", fontsize=20)
ms.bar(df)
Out[8]:
<Axes: title={'center': 'Missing Values Graph'}>

Observations

  • negativereason_gold and airline_sentiment_gold are missing 99.78% & 99.73% with only 40 and 32 entries respectively. Too many missing entries. These columns will be deleted.

  • negativereason, user_timezone, tweet_location and negativereason_confidence are missing over 28%. These columns will also be deleted because they are not crucial to our sentiment analysis, also imputing the missing values may lead to misleading conclusions.

  • Our two most important features, airline_sentiment and text, have no missing values.

  • It is unfortunate that we have to drop airline_sentiment_gold because that is the groundtruth label.

Duplicate Values¶

In [9]:
# Checking for duplicate records

df.duplicated().sum()
Out[9]:
36
In [10]:
# Remove duplicate rows based on all columns

df = df.drop_duplicates()
In [11]:
# Checking for duplicate records

df.duplicated().sum()
Out[11]:
0

Observations

  • There were 36 duplicate values.
  • All the duplicates have been dropped.

Exploratory Data Analysis¶

Univariate Analysis¶

In [12]:
# -----
# User defined function to plot labeled_barplot
# -----


def labeled_barplot(data, feature, perc=False, v_ticks=True, n=None):
    """
    Barplot with percentage at the top

    data: dataframe
    feature: dataframe column
    perc: whether to display percentages instead of count (default is False)
    n: displays the top n category levels (default is None, i.e., display all levels)
    """

    total = len(data[feature])  # length of the column
    count = data[feature].nunique()
    if n is None:
        plt.figure(figsize=(count + 1, 5))
    else:
        plt.figure(figsize=(n + 1, 5))

    if v_ticks is True:
        plt.xticks(rotation=90)

    ax = sns.countplot(
        data=data,
        x=feature,
        palette="Paired",
        order=data[feature].value_counts().index[:n].sort_values(),
    )

    for p in ax.patches:
        if perc == True:
            label = "{:.1f}%".format(
                100 * p.get_height() / total
            )  # percentage of each class of the category
        else:
            label = p.get_height()  # count of each level of the category

        x = p.get_x() + p.get_width() / 2  # width of the plot
        y = p.get_height()  # height of the plot

        ax.annotate(
            label,
            (x, y),
            ha="center",
            va="center",
            size=12,
            xytext=(0, 5),
            textcoords="offset points",
        )  # annotate the percentage
    plt.show()  # show the plot

Percentage of Tweets for each Airline¶

In [13]:
# Code to plot the Percentage of Tweets for each Airline

labeled_barplot(
    df, "airline", perc=True
)  # Code to plot the labeled barplot for airline

Observations

  • United is the most tweeted about airline with 26.2% of the tweets.
  • Virgin America is the least tweeted about airline with 3.5% of the tweets.
  • The others are between hovering between 15% - 20% of the tweets.

Distribution of Sentiments across all the Tweets¶

In [14]:
# Code to plot the Distribution of Sentiments across all the Tweets

labeled_barplot(
    df, "airline_sentiment", perc=True
)  # Code to plot the labeled barplot for airline_sentiment

Observations

  • Most of the Twitter Sentiments are negative (about 63%)
  • About 21% are neutral and about only 16% are positive.

Plot of all the Negative Reasons¶

In [15]:
# Code to shot the plot of all the Negative Reasons

labeled_barplot(
    df, "negativereason", perc=True
)  # Cdode to plot the labeled barplot for negative reason

Observations

  • Customer Service Issue dominates the negativereason followed by Late Flight
  • It's interesting that longlines and Damaged Luggage are not as significant

Bivariate Analysis¶

Distribution of Sentiment of Tweets for each Airline¶

In [16]:
# Code to plot the barplot for the distribution of each airline with total sentiments

airline_sentiment = (
    df.groupby(["airline", "airline_sentiment"]).airline_sentiment.count().unstack()
)
airline_sentiment.plot(kind="bar")
Out[16]:
<Axes: xlabel='airline'>

Observations

  • As expected from "Distribution of Sentiments across all the Tweets", negative sentiments are much more followed by neutral then positive sentiments for each airline apart from Virgin America.
  • The ratios of Sentiments for the airlines appear to be consistent apart from Virgin America.
  • For Virgin America negative sentiment is slightly higher than neutral sentiment which is also slightly higher than positive sentiment.

Distribution of Retweet of Sentiments for each Airline¶

In [17]:
# Code to show the Distribution of Retweet of Sentiments for each Airline

colors = {"positive": "green", "negative": "red", "neutral": "blue"}

for sentiment in df["airline_sentiment"].unique():
    subset = df[df["airline_sentiment"] == sentiment]
    plt.bar(
        subset["airline"],
        subset["retweet_count"],
        label=sentiment,
        color=colors[sentiment],
    )

plt.title("Airline Sentiments by Retweets by Airline")
plt.xlabel("Airline")
plt.ylabel("Retweet Count")
plt.legend(title="Sentiment", loc="upper right")
plt.xticks(rotation=90)
plt.show()

Observations

  • Southwest has the most retweet of positive sentiments followed by Virgin America
  • US Airways has the most retweet of negative sentiment followed by Delta
  • Neutral sentiments do not appear to get retweets.

WordCloud Analysis¶

In [18]:
#####
# Helper function to create and display Wordcloud
#####


def show_wordcloud(data, title):
    words = " ".join(data["text"])
    cleaned_word = " ".join(
        [
            word
            for word in words.split()
            if "http" not in word and not word.startswith("@") and word != "RT"
        ]
    )

    # Create a WordCloud object
    wordcloud = WordCloud(
        stopwords=STOPWORDS, background_color="black", width=3000, height=2500
    ).generate(cleaned_word)

    plt.figure(figsize=(14, 11), frameon=True)
    plt.imshow(wordcloud)
    plt.axis("off")
    plt.title(title, fontsize=30)
    plt.show()

Wordcloud for Negative Tweets¶

In [19]:
# Code to display Wordcloud for Negative Tweets

df_negative = df[df["airline_sentiment"] == "negative"]
show_wordcloud(data=df_negative, title="Negative Tweets")

Observations

  • We see "flight", "customer service", "bag", "hour", "time", "hold", "Cancelled Flight" etc... showing us that these are the dominant words/phrases in the negative tweets.

Wordcloud for Positive Tweets¶

In [20]:
# Code to display Wordcloud for positive tweets

df_positive = df[df["airline_sentiment"] == "positive"]
show_wordcloud(data=df_negative, title="Positive Tweets")

Observations

  • We see "flight", "Hour", "hold", "bag", "help","time", "customer service" etc... showing us that these are the dominant words/phrases in the positive tweets.
  • Understandably, "Flight" appears to be the most dominant word in both negative and positive tweet

Wordcloud for Neutral Tweets¶

In [21]:
# Code to display Wordcloud for neutral tweets

df_positive = df[df["airline_sentiment"] == "neutral"]
show_wordcloud(data=df_negative, title="Neutral Tweets")

Observations

  • We see "flight", "Hour", "Plane", "bag", "help","now" etc... showing us that these are the dominant words/phrases in the neutral tweets.
  • "Flight" appears to be the most dominant word in all neutral, negative and positive tweets.

Data Preparation for Modeling¶

Dropping all unnecessary columns¶

In [22]:
# Extract text and airline sentiment columns from the data
model_df = df[["airline_sentiment", "text"]]  # Code to get a subset of data
In [23]:
model_df.head()  # Code to display the first 5 rows of the dataset
Out[23]:
airline_sentiment text
0 neutral @VirginAmerica What @dhepburn said.
1 positive @VirginAmerica plus you've added commercials t...
2 neutral @VirginAmerica I didn't today... Must mean I n...
3 negative @VirginAmerica it's really aggressive to blast...
4 negative @VirginAmerica and it's a really big bad thing...
In [24]:
model_df.shape  # Code to get the shape of the data
Out[24]:
(14604, 2)
In [25]:
model_df[
    "airline_sentiment"
].value_counts()  # Code to display the unique values in airline sentiment column
Out[25]:
negative    9159
neutral     3091
positive    2354
Name: airline_sentiment, dtype: int64
In [26]:
model_df[
    "airline_sentiment"
].unique()  # Code to display the values in airline sentiment column
Out[26]:
array(['neutral', 'positive', 'negative'], dtype=object)

Observations

  • The new dataframe contains 2 columns and 14,604 rows.
  • The airline_sentiment column has 3 unique values: 'neutral', 'positive', 'negative'.
  • 2,354(16%) of the tweets are 'positive', 3,091(21%) are 'neutral' and 9,159 (63%) are 'negative'

Remove HTML Tages¶

In [27]:
# Code to remove the html tage
def strip_html(text):
    soup = BeautifulSoup(text, "html.parser")
    return soup.get_text()


model_df["text"] = model_df["text"].apply(
    strip_html
)  # Code to apply strip html function on text column
model_df.head()  # Code to display the head of the data
Out[27]:
airline_sentiment text
0 neutral @VirginAmerica What @dhepburn said.
1 positive @VirginAmerica plus you've added commercials t...
2 neutral @VirginAmerica I didn't today... Must mean I n...
3 negative @VirginAmerica it's really aggressive to blast...
4 negative @VirginAmerica and it's a really big bad thing...

Observations

  • HTML Tags have been removed.

Replacing Contractions in String¶

In [28]:
def replace_contractions(text):
    """Replace contractions in string of text"""
    return contractions.fix(text)


model_df["text"] = model_df["text"].apply(
    replace_contractions
)  # Code to apply replace contractions function on text column

model_df.head()  # Code to display the head of the data
Out[28]:
airline_sentiment text
0 neutral @VirginAmerica What @dhepburn said.
1 positive @VirginAmerica plus you have added commercials...
2 neutral @VirginAmerica I did not today... Must mean I ...
3 negative @VirginAmerica it is really aggressive to blas...
4 negative @VirginAmerica and it is a really big bad thin...

Observations

  • Contractions fixed. For instance "didn't" has been changed to "did not" in row 2

Removing Numbers¶

In [29]:
def remove_numbers(text):
    text = re.sub(r"\d+", "", text)  # Code to remove numbers
    return text


model_df["text"] = model_df["text"].apply(
    remove_numbers
)  # Code to apply remove numbers function on text column

model_df.head()  # Code to display the head of the data
Out[29]:
airline_sentiment text
0 neutral @VirginAmerica What @dhepburn said.
1 positive @VirginAmerica plus you have added commercials...
2 neutral @VirginAmerica I did not today... Must mean I ...
3 negative @VirginAmerica it is really aggressive to blas...
4 negative @VirginAmerica and it is a really big bad thin...

Observations

  • All the numbers have been removed.

Applying Tokenization¶

In [30]:
data.apply(lambda row: nltk.word_tokenize(row["text"]), axis=1)
Out[30]:
0           [@, VirginAmerica, What, @, dhepburn, said, .]
1        [@, VirginAmerica, plus, you, 've, added, comm...
2        [@, VirginAmerica, I, did, n't, today, ..., Mu...
3        [@, VirginAmerica, it, 's, really, aggressive,...
4        [@, VirginAmerica, and, it, 's, a, really, big...
                               ...                        
14635    [@, AmericanAir, thank, you, we, got, on, a, d...
14636    [@, AmericanAir, leaving, over, 20, minutes, L...
14637    [@, AmericanAir, Please, bring, American, Airl...
14638    [@, AmericanAir, you, have, my, money, ,, you,...
14639    [@, AmericanAir, we, have, 8, ppl, so, we, nee...
Length: 14640, dtype: object
In [31]:
# Code to apply tokenization on text column
model_df["text"] = model_df.apply(lambda row: nltk.word_tokenize(row["text"]), axis=1)

# Code to display the head of the data
model_df.head()
Out[31]:
airline_sentiment text
0 neutral [@, VirginAmerica, What, @, dhepburn, said, .]
1 positive [@, VirginAmerica, plus, you, have, added, com...
2 neutral [@, VirginAmerica, I, did, not, today, ..., Mu...
3 negative [@, VirginAmerica, it, is, really, aggressive,...
4 negative [@, VirginAmerica, and, it, is, a, really, big...

Observations

  • All the documents have been tokenized.

Applying Lowercase, Remove stopwords & Punctuation¶

Modifying Stopwords¶

In [32]:
# We want stop-word's like "not", "couldn't" etc. because these words matter in Sentiment analysis,
# so we will be removing them from original stopwords data.

stopwords = stopwords.words("english")

customlist = [
    "not",
    "couldn't",
    "didn",
    "didn't",
    "doesn",
    "doesn't",
    "hadn",
    "hadn't",
    "hasn",
    "hasn't",
    "haven",
    "haven't",
    "isn",
    "isn't",
    "ma",
    "mightn",
    "mightn't",
    "mustn",
    "mustn't",
    "needn",
    "needn't",
    "shan",
    "shan't",
    "shouldn",
    "shouldn't",
    "wasn",
    "wasn't",
    "weren",
    "weren't",
    "won",
    "won't",
    "wouldn",
    "wouldn't",
]

# Removing our custom list from stopwords
stopwords = list(set(stopwords) - set(customlist))

PreProcessing¶

In [33]:
lemmatizer = WordNetLemmatizer()  # Instantiating the WordNetLemmatizer


#####
# Defining preprocessing helper functions
#####
def remove_non_ascii(words):
    """Remove non-ASCII characters from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = (
            unicodedata.normalize("NFKD", word)
            .encode("ascii", "ignore")
            .decode("utf-8", "ignore")
        )
        new_words.append(new_word)
    return new_words


def to_lowercase(words):
    """Convert all characters to lowercase from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = word.lower()
        new_words.append(new_word)
    return new_words


def remove_punctuation(words):
    """Remove punctuation from list of tokenized words"""
    new_words = []
    for word in words:
        new_word = re.sub(r"[^\w\s]", "", word)
        if new_word != "":
            new_words.append(new_word)
    return new_words


def remove_stopwords(words):
    """Remove stop words from list of tokenized words"""
    new_words = []
    for word in words:
        if word not in stopwords:
            new_words.append(word)
    return new_words


def lemmatize_list(words):
    new_words = []
    for word in words:
        new_words.append(lemmatizer.lemmatize(word, pos="v"))
    return new_words


def normalize(words):
    words = remove_non_ascii(words)
    words = to_lowercase(words)
    words = remove_punctuation(words)
    words = remove_stopwords(words)
    words = lemmatize_list(words)
    return " ".join(words)


# Applying all the preprocessing functions on our corpus
model_df["text"] = model_df.apply(lambda row: normalize(row["text"]), axis=1)
model_df.head()
Out[33]:
airline_sentiment text
0 neutral virginamerica dhepburn say
1 positive virginamerica plus add commercials experience ...
2 neutral virginamerica not today must mean need take an...
3 negative virginamerica really aggressive blast obnoxiou...
4 negative virginamerica really big bad thing

Observations

  • Data has been preprocessed and our corpus (text) has been normalized.

Model Building¶

In [34]:
#####
# Code to create a dataframe to store accuracy scores of each model
#####

# Create an empty DataFrame with the specified columns
columns = [
    "Random Forest with BoW",
    "Random Forest with TF-IDF",
    "LSTM with Keras Tokenizer",
    "LSTM with GloVe embedding",
]

accuracy_df = pd.DataFrame(columns=columns)

# Define a list where the train & test accuracy scores for each model will be stored before tranfering to the dataframe.
accuracy_scores_train = []
accuracy_scores_test = []

Bag of Words (BoW) Vectors - Using countvectorizer¶

In [35]:
# Vectorization (Convert our corpus (text data) to numbers).

# Code to initialize the CountVectorizer function with max_ features = 5000.
bow_vec = CountVectorizer(max_features=5000)

# Code to fit and transrofm the count_vec variable on the text column
bow_features = bow_vec.fit_transform(model_df["text"])

# Code to convert the datafram into array
bow_features = bow_features.toarray()
In [36]:
bow_features.shape  # Code to check the shape of the data features
Out[36]:
(14604, 5000)

Storing Independent and Dependent variables¶

In [37]:
X = bow_features  # Code to get the independent variable stored as X

y = model_df["airline_sentiment"]  # Code to get the dependent variable stored as Y

Splitting the data into train and test¶

In [38]:
# Split data into training and testing set.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

Random Forest Model on BoW¶

In [39]:
# Using Random Forest to build model for the classification of reviews.

rf_bow = RandomForestClassifier(
    n_estimators=10, n_jobs=4
)  # Initialize the Random Forest Classifier

rf_bow = rf_bow.fit(X_train, y_train)  # Fit the rf model on X_train and y_train

print(rf_bow)

print(
    np.mean(cross_val_score(rf_bow, X_train, y_train, cv=10))
)  # Calculate cross validation score
RandomForestClassifier(n_estimators=10, n_jobs=4)
0.7574832664757543

Observations

  • The Cross Validation Score on the Train Set is 75%.
  • Now let's try to improve on the model.

Optimize the parameter: The number of trees in the random forest model(n_estimators)¶

In [40]:
# Finding optimal number of base learners using k-fold CV ->
base_ln = [x for x in range(1, 25)]
In [41]:
# K-Fold Cross - validation .
cv_scores = []  # Initializing a emptry list to store the score
for b in base_ln:
    clf_bow = RandomForestClassifier(
        n_estimators=b
    )  # Code to apply Rondome Forest Classifier
    scores = cross_val_score(
        clf_bow, X_train, y_train, cv=10, scoring="accuracy"
    )  # Code to find the cross-validation score on the classifier (clf) for accuracy
    cv_scores.append(scores.mean())  # Append the scores to cv_scores list
In [42]:
# plot the error as k increases
error = [1 - x for x in cv_scores]  # Error corresponds to each number of estimator
optimal_learners = base_ln[
    error.index(min(error))
]  # Selection of optimal number of n_estimator corresponds to minimum error.
plt.plot(
    base_ln, error
)  # Plot between each number of estimator and misclassification error
xy = (optimal_learners, min(error))
plt.annotate("(%s, %s)" % xy, xy=xy, textcoords="data")
plt.xlabel("Number of base learners")
plt.ylabel("Misclassification Error")
plt.show()
In [43]:
# Train the best model and calculating accuracy on test data .
clf_bow = RandomForestClassifier(
    n_estimators=optimal_learners
)  # Initialize the Random Forest classifier with optimal learners
clf_bow.fit(X_train, y_train)  # Fit the classifer on X_train and y_train

bow_trainscore = clf_bow.score(
    X_train, y_train
)  # Find the score on X_train and y_train
bow_testscore = clf_bow.score(X_test, y_test)  # Find the score on X_test and y_test
In [44]:
accuracy_scores_train.append(bow_trainscore)
accuracy_scores_test.append(bow_testscore)

print("Train Score: ", accuracy_scores_train)
print("Test Score: ", accuracy_scores_test)
Train Score:  [0.9917824300528273]
Test Score:  [0.7535371976266545]

Observations

  • The model did not improve as both cross validation score and test score is 75%
  • The Optimal number of learners (Decision Trees: n-estimators) is 24 with a loss of 0.236.
  • The model is overfitting on the Train set (99%).

Confusion Matrix and Classification Report¶

In [45]:
# Predict the result for test data using the model built above.
result = clf_bow.predict(
    X_test
)  # Code to predict the X_test data using the model built above (forest)
In [46]:
# Print and plot Confusion matirx

conf_mat = confusion_matrix(
    y_test, result
)  # Code to calculate the confusion matrix between test data and result

print(conf_mat)  # Print confusion matrix
[[2479  191   71]
 [ 426  439   84]
 [ 180  128  384]]
In [47]:
# Plot the confusion matrix
df_cm = pd.DataFrame(
    conf_mat,
    index=[i for i in ["positive", "negative", "neutral"]],
    columns=[i for i in ["positive", "negative", "neutral"]],
)
plt.figure(figsize=(10, 7))
sns.heatmap(df_cm, annot=True, fmt="g")
Out[47]:
<Axes: >
In [48]:
# Generate the classification report
report = classification_report(y_test, result)

# Print the classification report
print(report)
              precision    recall  f1-score   support

    negative       0.80      0.90      0.85      2741
     neutral       0.58      0.46      0.51       949
    positive       0.71      0.55      0.62       692

    accuracy                           0.75      4382
   macro avg       0.70      0.64      0.66      4382
weighted avg       0.74      0.75      0.74      4382

Observations

  • The accuracy score of the Random Forest Classifier on Bag of Words (bow) vectors is 75% on unseen data.
  • The 'negative' tweets have the highest recall & f1 scores of 90% & 85% respectively.
  • The 'positive' tweets have the highest precision scores of 71%

Wordcloud of top 40 important features from BoW+Randomforest Model¶

In [49]:
all_features = (
    bow_vec.get_feature_names_out()
)  # Instantiate the feature from the vectorizer
top_features = (
    ""  # Addition of top 40 feature into top_feature after training the model
)
feat = clf_bow.feature_importances_
features = np.argsort(feat)[::-1]
for i in features[0:40]:
    top_features += all_features[i]
    top_features += ","

print(top_features)

print(" ")
print(" ")

# Complete the code by applying wordcloud on top features
wordcloud = WordCloud(background_color="black", width=3000, height=2500).generate(
    top_features
)
thank,not,great,jetblue,usairways,delay,http,flight,southwestair,unite,hours,hold,awesome,love,get,cancel,bag,americanair,wait,virginamerica,best,call,amaze,hour,dm,time,lose,service,please,customer,help,appreciate,go,hrs,make,need,follow,plane,never,still,
 
 
In [50]:
plt.figure(figsize=(14, 11), frameon=True)
plt.imshow(wordcloud)
plt.axis("off")
plt.title("Top 40 features WordCloud", fontsize=30)
plt.show()

Observations

  • We see "thank","hour", "great", "delay", "jetblue" etc... showing us that these are the dominant words.
  • It's interesting to see "jetblue" as a more dominant word than some of the airlines represented in the airlines column...considering that jetblue is not a part of our analysis. Further analysis needed.

Using TF-IDF (Term Frequency- Inverse Document Frequency)¶

In [51]:
# Using TfidfVectorizer to convert text data to numbers.

tfidf_vect = TfidfVectorizer(max_features=5000)          # Code to initialize the TF-IDF vector function with max_features = 5000.
tfidf_features = tfidf_vect.fit_transform(data['text'])   # Fit_transform the tf idf function on the text column

tfidf_features = tfidf_features.toarray()                  # Code to convert the dataframe into array
In [52]:
tfidf_features.shape  # Code to check the shape of the data features
Out[52]:
(14640, 5000)

Storing Independent and Dependent variables¶

In [53]:
X = tfidf_features  # Code to get the independent variable (data_features) stored as X

y = data[
    "airline_sentiment"
]  # Code to get the dependent variable (airline_sentiment) stored as y

Splitting the data into train and test¶

In [54]:
# Split data into training and testing set.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)

Random Forest Model on TF-IDF¶

In [55]:
# Using Random Forest to build model for the classification of reviews.

rf_tfidf = RandomForestClassifier(
    n_estimators=10, n_jobs=4
)  # Initialize the Random Forest Classifier

rf_tfidf = rf_tfidf.fit(
    X_train, y_train
)  # Fit the forest variable on X_train and y_train

print(rf_tfidf)

print(
    np.mean(cross_val_score(rf_tfidf, X_train, y_train, cv=10))
)  # Calculate cross validation score
RandomForestClassifier(n_estimators=10, n_jobs=4)
0.7307772484756098

Observations

  • The Random Forest Classifier Cross Validation Score on the Train Set of TF-IDF is 73%...down by 2% from 75% of Bag of Words.
  • Now let's try to improve on the model.

Optimize the parameter: The number of trees in the random forest model(n_estimators)¶

In [56]:
# Finding optimal number of base learners using k-fold CV ->
base_ln = [x for x in range(1, 25)]
In [57]:
# K-Fold Cross - validation .
cv_scores = []  # Initializing a emptry list to store the score
for b in base_ln:
    clf_tfidf = RandomForestClassifier(
        n_estimators=b
    )  # Code to apply Rondome Forest Classifier
    scores = cross_val_score(
        clf_tfidf, X_train, y_train, cv=10, scoring="accuracy"
    )  # Code to find the cross-validation score on the classifier (clf) for accuracy
    cv_scores.append(scores.mean())  # Append the scores to cv_scores list
In [58]:
# Plot the misclassification error for each of estimators

error = [1 - x for x in cv_scores]  # Error corresponds to each number of estimator
optimal_learners = base_ln[
    error.index(min(error))
]  # Selection of optimal number of n_estimator corresponds to minimum error.
plt.plot(
    base_ln, error
)  # Plot between each number of estimator and misclassification error
xy = (optimal_learners, min(error))
plt.annotate("(%s, %s)" % xy, xy=xy, textcoords="data")
plt.xlabel("Number of base learners")
plt.ylabel("Misclassification Error")
plt.show()
In [59]:
# Train the best model and calculating accuracy on test data .
clf_tfidf = RandomForestClassifier(
    n_estimators=optimal_learners
)  # Initialize the Random Forest classifier with optimal learners
clf_tfidf.fit(X_train, y_train)  # Fit the classifer on X_train and y_train

tfidf_trainscore = clf_tfidf.score(
    X_train, y_train
)  # Find the score on X_train and y_train
tfidf_testscore = clf_tfidf.score(X_test, y_test)  # Find the score on X_test and y_test
In [60]:
accuracy_scores_train.append(tfidf_trainscore)
accuracy_scores_test.append(tfidf_testscore)

print("Train Score: ", accuracy_scores_train)
print("Test Score: ", accuracy_scores_test)
Train Score:  [0.9917824300528273, 0.9942427790788446]
Test Score:  [0.7535371976266545, 0.7470400728597449]

Observations

  • The Optimal number of learners (Decision Trees: n-estimators) is 23 with a loss of 0.255.
  • The Accuracy Score of the optimal number of learners on the Test Set is 74% (1% better than Train (73%)). And 1% less than BoW (75%).
  • The model is overfitting on train set (99%).

Confusion Matrix and Classification Report¶

In [61]:
# Predict the result for test data using the model built above.
result = clf_tfidf.predict(
    X_test
)  # Code to predict the X_test data using the model built above (forest)
In [62]:
# Plot the confusion matrix
conf_mat = confusion_matrix(
    y_test, result
)  # Complete the code to calculate the confusion matrix between test data and restust


df_cm = pd.DataFrame(
    conf_mat,
    index=[i for i in ["positive", "negative", "neutral"]],
    columns=[i for i in ["positive", "negative", "neutral"]],
)
plt.figure(figsize=(10, 7))
sns.heatmap(
    df_cm, annot=True, fmt="g"
)  # Complete the code to plot the heatmap of the confusion matrix
Out[62]:
<Axes: >
In [63]:
# Generate the classification report
report = classification_report(y_test, result)

# Print the classification report
print(report)
              precision    recall  f1-score   support

    negative       0.75      0.96      0.84      2741
     neutral       0.66      0.36      0.46       936
    positive       0.82      0.45      0.58       715

    accuracy                           0.75      4392
   macro avg       0.74      0.59      0.63      4392
weighted avg       0.74      0.75      0.72      4392

Observations

  • The accuracy score of the Random Forest Classifier on TF-IDF vectors is 74% on unseen data.
  • The 'negative' tweets have the highest recall & f1 scores of 96% & 84% respectively.
  • The 'positive' tweets have the highest precision scores of 80%

Wordcloud of top 40 important features from TF-IDF+Randomforest Model¶

In [64]:
all_features = (
    tfidf_vect.get_feature_names_out()
)  # Instantiate the feature from the vectorizer
top_features = (
    ""  # Addition of top 40 feature into top_feature after training the model
)
feat = clf_tfidf.feature_importances_
features = np.argsort(feat)[::-1]
for i in features[0:40]:
    top_features += all_features[i]
    top_features += ", "

print(top_features)

print(" ")
print(" ")

# Complete the code by applying wordcloud on top features
wordcloud = WordCloud(background_color="black", width=2000, height=1500).generate(
    top_features
)
thank, thanks, southwestair, jetblue, usairways, americanair, you, to, united, great, http, the, on, flight, not, no, co, for, and, is, virginamerica, hold, awesome, your, love, my, can, it, in, cancelled, dm, of, but, from, that, have, amazing, will, delayed, be, 
 
 
In [65]:
# Display the generated image:
plt.figure(1, figsize=(14, 11), frameon="equal")
plt.imshow(wordcloud, interpolation="bilinear")
plt.axis("off")
plt.title("Top 40 features WordCloud", fontsize=30)
plt.show()

Observations

  • We see "thank", "jetblue", "southwestair", "usairways" etc... showing us that these are the dominant words.
  • We see again that "jetblue" is more dominant word than any of the airlines represented in the airlines column. Further analysis needed.

Using Long Short Term Memory (LSTM) Neural Network¶

In [66]:
vocab_size = 5000     #Vocabulary Size
oov_token = "<OOV>"   #Out of Vocabulary Token. Place holder for OOV words
max_len = 50          #Maximum length for padding

def tokenize_pad_sequences(text):
    '''
    This function tokenize the input text into sequnences of intergers and then
    pad each sequence to the same length
    '''
    # Text tokenization
    tokenizer = Tokenizer(num_words=vocab_size, oov_token=oov_token)
    tokenizer.fit_on_texts(text)
    # Transforms text to a sequence of integers
    X = tokenizer.texts_to_sequences(text)
    # Pad sequences to the same length
    X = pad_sequences(X, padding='post', maxlen=max_len)
    # return sequences
    return X, tokenizer

Tokenization & Padding¶

In [67]:
X, tokenizer = tokenize_pad_sequences(
    model_df["text"]
)  # Vectorize, Pad and Sequence the corpus

print(
    "Before Tokenization & Padding \n", model_df["text"][22]
)  # Print a sample document before tokenization & padding

print(
    "\nAfter Tokenization & Padding \n", X[22]
)  # Print a sample document after tokenization & padding
Before Tokenization & Padding 
 virginamerica love hipster innovation feel good brand

After Tokenization & Padding 
 [  36   70 4820 2339  311   80 1021    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0    0    0    0    0    0    0
    0    0    0    0    0    0    0    0]
In [68]:
# Get the Word Index: the mapping of the words to numbers.
word_index = tokenizer.word_index

# Convert the dictionary to a list of key-value pairs to display first 40 elements
first_40_elements = list(word_index.items())
first_40_elements[:40]  # show first 40
Out[68]:
[('<OOV>', 1),
 ('flight', 2),
 ('unite', 3),
 ('not', 4),
 ('usairways', 5),
 ('americanair', 6),
 ('southwestair', 7),
 ('jetblue', 8),
 ('get', 9),
 ('thank', 10),
 ('http', 11),
 ('cancel', 12),
 ('service', 13),
 ('delay', 14),
 ('time', 15),
 ('help', 16),
 ('go', 17),
 ('fly', 18),
 ('call', 19),
 ('bag', 20),
 ('wait', 21),
 ('customer', 22),
 ('us', 23),
 ('would', 24),
 ('hold', 25),
 ('make', 26),
 ('need', 27),
 ('hours', 28),
 ('plane', 29),
 ('try', 30),
 ('still', 31),
 ('please', 32),
 ('one', 33),
 ('gate', 34),
 ('back', 35),
 ('virginamerica', 36),
 ('seat', 37),
 ('take', 38),
 ('say', 39),
 ('flightled', 40)]

Train & Test Split¶

In [69]:
y = pd.get_dummies(model_df["airline_sentiment"])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=1)

print("Train Set ->", X_train.shape, y_train.shape)
print("Test Set ->", X_test.shape, y_test.shape)
Train Set -> (11683, 50) (11683, 3)
Test Set -> (2921, 50) (2921, 3)

Bidirectional LSTM Neural Network¶

In [70]:
#####
# Code to build a Long-Short Term memory (LSTM) Neural Network
# This model will be used on the vectors generated by the Keras Tokenizer
#####

embedding_size = 32

rnn_model = Sequential()
# Adding Embedding layer with vocab_size, embedding vectors of embedding_size, and input size of the train data
rnn_model.add(Embedding(vocab_size, embedding_size, input_length=max_len))
# Adding SpatialDropout1D with ratio of 0.2
rnn_model.add(SpatialDropout1D(0.2))
# Adding Bidirectional LSTM layer 
rnn_model.add(Bidirectional(LSTM(128)))
# Adding a dropout ratio of 0.4
rnn_model.add(Dropout(0.4))
# Adding a Dense layer
rnn_model.add(Dense(256, activation='relu'))
# Adding a dropout ratio of 0.4
rnn_model.add(Dropout(0.4))
# Adding output layer with 3 units with softmax as activation function
rnn_model.add(Dense(3, activation="softmax"))
In [71]:
rnn_model.summary()
Model: "sequential"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 embedding (Embedding)       (None, 50, 32)            160000    
                                                                 
 spatial_dropout1d (SpatialD  (None, 50, 32)           0         
 ropout1D)                                                       
                                                                 
 bidirectional (Bidirectiona  (None, 256)              164864    
 l)                                                              
                                                                 
 dropout (Dropout)           (None, 256)               0         
                                                                 
 dense (Dense)               (None, 256)               65792     
                                                                 
 dropout_1 (Dropout)         (None, 256)               0         
                                                                 
 dense_1 (Dense)             (None, 3)                 771       
                                                                 
=================================================================
Total params: 391,427
Trainable params: 391,427
Non-trainable params: 0
_________________________________________________________________
In [72]:
# Compile model

adam = Adam(lr=0.001)

rnn_model.compile(
    loss="categorical_crossentropy",
    optimizer=adam,
    metrics=["accuracy"],
)
In [73]:
# Setup callbacks
callbacks = [
    EarlyStopping(
        monitor="loss", mode="min", verbose=1, patience=10
    ),  # stop the training process once the model stops improving.
    ModelCheckpoint(
        filepath="model_weights.h5", save_best_only=True, monitor="loss", mode="min"
    ),  # for saving the best model during training
]

# Train model
epochs = 50
history = rnn_model.fit(
    X_train,
    y_train,
    validation_split=0.2,
    epochs=epochs,
    verbose=1,
    callbacks=callbacks,
)
Epoch 1/50
293/293 [==============================] - 21s 59ms/step - loss: 0.7022 - accuracy: 0.7037 - val_loss: 0.5808 - val_accuracy: 0.7578
Epoch 2/50
293/293 [==============================] - 17s 59ms/step - loss: 0.5004 - accuracy: 0.8037 - val_loss: 0.5302 - val_accuracy: 0.7826
Epoch 3/50
293/293 [==============================] - 17s 57ms/step - loss: 0.4041 - accuracy: 0.8446 - val_loss: 0.5275 - val_accuracy: 0.7920
Epoch 4/50
293/293 [==============================] - 17s 56ms/step - loss: 0.3424 - accuracy: 0.8727 - val_loss: 0.5470 - val_accuracy: 0.7895
Epoch 5/50
293/293 [==============================] - 17s 59ms/step - loss: 0.3008 - accuracy: 0.8908 - val_loss: 0.6084 - val_accuracy: 0.7873
Epoch 6/50
293/293 [==============================] - 17s 59ms/step - loss: 0.2622 - accuracy: 0.9016 - val_loss: 0.6732 - val_accuracy: 0.7805
Epoch 7/50
293/293 [==============================] - 18s 61ms/step - loss: 0.2355 - accuracy: 0.9146 - val_loss: 0.7452 - val_accuracy: 0.7728
Epoch 8/50
293/293 [==============================] - 17s 58ms/step - loss: 0.2200 - accuracy: 0.9191 - val_loss: 0.7636 - val_accuracy: 0.7664
Epoch 9/50
293/293 [==============================] - 18s 61ms/step - loss: 0.1925 - accuracy: 0.9315 - val_loss: 0.8345 - val_accuracy: 0.7595
Epoch 10/50
293/293 [==============================] - 18s 60ms/step - loss: 0.1776 - accuracy: 0.9376 - val_loss: 0.8412 - val_accuracy: 0.7694
Epoch 11/50
293/293 [==============================] - 17s 56ms/step - loss: 0.1702 - accuracy: 0.9383 - val_loss: 0.9212 - val_accuracy: 0.7617
Epoch 12/50
293/293 [==============================] - 17s 59ms/step - loss: 0.1507 - accuracy: 0.9474 - val_loss: 0.9909 - val_accuracy: 0.7574
Epoch 13/50
293/293 [==============================] - 17s 58ms/step - loss: 0.1475 - accuracy: 0.9470 - val_loss: 1.0244 - val_accuracy: 0.7612
Epoch 14/50
293/293 [==============================] - 21s 73ms/step - loss: 0.1337 - accuracy: 0.9532 - val_loss: 1.0815 - val_accuracy: 0.7501
Epoch 15/50
293/293 [==============================] - 17s 58ms/step - loss: 0.1229 - accuracy: 0.9563 - val_loss: 1.1871 - val_accuracy: 0.7604
Epoch 16/50
293/293 [==============================] - 16s 56ms/step - loss: 0.1164 - accuracy: 0.9573 - val_loss: 1.3097 - val_accuracy: 0.7544
Epoch 17/50
293/293 [==============================] - 17s 57ms/step - loss: 0.1150 - accuracy: 0.9590 - val_loss: 1.2128 - val_accuracy: 0.7552
Epoch 18/50
293/293 [==============================] - 17s 58ms/step - loss: 0.1030 - accuracy: 0.9619 - val_loss: 1.3846 - val_accuracy: 0.7454
Epoch 19/50
293/293 [==============================] - 19s 63ms/step - loss: 0.1054 - accuracy: 0.9618 - val_loss: 1.2902 - val_accuracy: 0.7480
Epoch 20/50
293/293 [==============================] - 17s 58ms/step - loss: 0.0982 - accuracy: 0.9665 - val_loss: 1.5996 - val_accuracy: 0.7480
Epoch 21/50
293/293 [==============================] - 18s 60ms/step - loss: 0.0963 - accuracy: 0.9645 - val_loss: 1.3531 - val_accuracy: 0.7493
Epoch 22/50
293/293 [==============================] - 18s 60ms/step - loss: 0.0944 - accuracy: 0.9670 - val_loss: 1.4976 - val_accuracy: 0.7398
Epoch 23/50
293/293 [==============================] - 18s 60ms/step - loss: 0.0835 - accuracy: 0.9697 - val_loss: 1.5929 - val_accuracy: 0.7471
Epoch 24/50
293/293 [==============================] - 18s 60ms/step - loss: 0.0907 - accuracy: 0.9686 - val_loss: 1.5103 - val_accuracy: 0.7535
Epoch 25/50
293/293 [==============================] - 17s 58ms/step - loss: 0.0791 - accuracy: 0.9710 - val_loss: 1.6715 - val_accuracy: 0.7480
Epoch 26/50
293/293 [==============================] - 17s 57ms/step - loss: 0.0795 - accuracy: 0.9706 - val_loss: 1.6864 - val_accuracy: 0.7471
Epoch 27/50
293/293 [==============================] - 17s 58ms/step - loss: 0.0782 - accuracy: 0.9695 - val_loss: 1.6679 - val_accuracy: 0.7441
Epoch 28/50
293/293 [==============================] - 17s 58ms/step - loss: 0.0765 - accuracy: 0.9734 - val_loss: 1.6301 - val_accuracy: 0.7368
Epoch 29/50
293/293 [==============================] - 17s 58ms/step - loss: 0.0679 - accuracy: 0.9766 - val_loss: 1.8098 - val_accuracy: 0.7501
Epoch 30/50
293/293 [==============================] - 17s 57ms/step - loss: 0.0676 - accuracy: 0.9769 - val_loss: 1.5980 - val_accuracy: 0.7565
Epoch 31/50
293/293 [==============================] - 17s 59ms/step - loss: 0.0671 - accuracy: 0.9760 - val_loss: 1.6633 - val_accuracy: 0.7458
Epoch 32/50
293/293 [==============================] - 16s 56ms/step - loss: 0.0669 - accuracy: 0.9769 - val_loss: 1.9026 - val_accuracy: 0.7450
Epoch 33/50
293/293 [==============================] - 16s 56ms/step - loss: 0.0689 - accuracy: 0.9750 - val_loss: 1.7012 - val_accuracy: 0.7467
Epoch 34/50
293/293 [==============================] - 16s 56ms/step - loss: 0.0591 - accuracy: 0.9786 - val_loss: 1.9459 - val_accuracy: 0.7343
Epoch 35/50
293/293 [==============================] - 16s 55ms/step - loss: 0.0622 - accuracy: 0.9776 - val_loss: 1.8143 - val_accuracy: 0.7441
Epoch 36/50
293/293 [==============================] - 16s 55ms/step - loss: 0.0619 - accuracy: 0.9777 - val_loss: 2.0195 - val_accuracy: 0.7454
Epoch 37/50
293/293 [==============================] - 16s 55ms/step - loss: 0.0572 - accuracy: 0.9795 - val_loss: 1.9636 - val_accuracy: 0.7505
Epoch 38/50
293/293 [==============================] - 17s 58ms/step - loss: 0.0625 - accuracy: 0.9796 - val_loss: 1.7665 - val_accuracy: 0.7463
Epoch 39/50
293/293 [==============================] - 16s 55ms/step - loss: 0.0584 - accuracy: 0.9783 - val_loss: 1.8420 - val_accuracy: 0.7424
Epoch 40/50
293/293 [==============================] - 18s 61ms/step - loss: 0.0555 - accuracy: 0.9804 - val_loss: 1.9567 - val_accuracy: 0.7415
Epoch 41/50
293/293 [==============================] - 18s 61ms/step - loss: 0.0510 - accuracy: 0.9815 - val_loss: 1.9890 - val_accuracy: 0.7313
Epoch 42/50
293/293 [==============================] - 17s 58ms/step - loss: 0.0543 - accuracy: 0.9807 - val_loss: 2.0829 - val_accuracy: 0.7398
Epoch 43/50
293/293 [==============================] - 16s 56ms/step - loss: 0.0510 - accuracy: 0.9819 - val_loss: 2.0434 - val_accuracy: 0.7356
Epoch 44/50
293/293 [==============================] - 17s 58ms/step - loss: 0.0532 - accuracy: 0.9811 - val_loss: 2.1640 - val_accuracy: 0.7411
Epoch 45/50
293/293 [==============================] - 17s 58ms/step - loss: 0.0479 - accuracy: 0.9838 - val_loss: 2.1204 - val_accuracy: 0.7351
Epoch 46/50
293/293 [==============================] - 17s 59ms/step - loss: 0.0532 - accuracy: 0.9804 - val_loss: 1.9121 - val_accuracy: 0.7351
Epoch 47/50
293/293 [==============================] - 17s 57ms/step - loss: 0.0480 - accuracy: 0.9834 - val_loss: 2.0331 - val_accuracy: 0.7415
Epoch 48/50
293/293 [==============================] - 16s 56ms/step - loss: 0.0513 - accuracy: 0.9815 - val_loss: 2.1419 - val_accuracy: 0.7321
Epoch 49/50
293/293 [==============================] - 17s 57ms/step - loss: 0.0640 - accuracy: 0.9780 - val_loss: 2.0328 - val_accuracy: 0.7338
Epoch 50/50
293/293 [==============================] - 17s 56ms/step - loss: 0.0415 - accuracy: 0.9845 - val_loss: 2.3481 - val_accuracy: 0.7386
In [74]:
# Evaluate the best model on the test set
loss, LSTM_testscore = rnn_model.evaluate(X_test, y_test, verbose=0)

# Evaluate the best model on the trainset set
loss, LSTM_trainscore = rnn_model.evaluate(X_train, y_train, verbose=0)
In [75]:
accuracy_scores_train.append(LSTM_trainscore)
accuracy_scores_test.append(LSTM_testscore)

print("Train Acurracy Score: ", accuracy_scores_train)
print("Test Acurracy Score: ", accuracy_scores_test)
Train Acurracy Score:  [0.9917824300528273, 0.9942427790788446, 0.9405118823051453]
Test Acurracy Score:  [0.7535371976266545, 0.7470400728597449, 0.734337568283081]
In [76]:
def plot_training_hist(history):
    """Function to plot history for accuracy and loss"""

    fig, ax = plt.subplots(1, 2, figsize=(10, 4))
    # first plot
    ax[0].plot(history.history["accuracy"])
    ax[0].plot(history.history["val_accuracy"])
    ax[0].set_title("Model Accuracy")
    ax[0].set_xlabel("epoch")
    ax[0].set_ylabel("accuracy")
    ax[0].legend(["train", "validation"], loc="best")
    # second plot
    ax[1].plot(history.history["loss"])
    ax[1].plot(history.history["val_loss"])
    ax[1].set_title("Model Loss")
    ax[1].set_xlabel("epoch")
    ax[1].set_ylabel("loss")
    ax[1].legend(["train", "validation"], loc="best")


plot_training_hist(history)

Observations

  • The accuracy score of the LSTM Classifier on Keras Tokenizer is 73% on unseen data.
  • The model is overfitting. Train set is 94% accuracy and validation & test score about 73%.
  • The model is not converging. A lot more data is needed to train the model so as to learn patterns in unseen data

Using GLOVE with LSTM¶

In [77]:
# create the dictionary with those embeddings

glove_embeddings_index = {}
glove_file = open(
    "glove/glove.6B.50d.txt"
)  # The 50 in the name is the same as the maximum length chosen for padding.
for line in glove_file:
    values = line.split()
    word = values[0]
    coefs = np.asarray(values[1:], dtype="float32")
    glove_embeddings_index[word] = coefs
glove_file.close()

print("Found %s word vectors." % len(glove_embeddings_index))
Found 400000 word vectors.
In [78]:
# create a word embedding matrix for each word in the word index
# If a word doesn't have an embedding in GloVe it will be presented with a zero matrix.

embedding_matrix = np.zeros((len(word_index) + 1, max_len))
for word, i in word_index.items():
    embedding_vector = glove_embeddings_index.get(word)
    if embedding_vector is not None:
        # words not found in embedding index will be all-zeros.
        embedding_matrix[i] = embedding_vector
In [79]:
# Show any element of the matrix to confirm content
embedding_matrix[22]
Out[79]:
array([ 0.50120002,  0.052743  ,  0.71052998,  0.46959001,  1.05519998,
        0.023635  , -0.68181998,  0.18503   ,  0.83736002, -0.055731  ,
        0.37808999,  0.43691   , -0.10603   , -0.31305999,  0.060604  ,
       -0.1005    , -1.15310001,  0.37011999,  1.07799995, -1.28260005,
        0.83467001, -0.098129  , -0.85596001,  0.70467001, -0.012172  ,
       -0.97125   , -0.18861   , -0.16795   ,  0.74255002,  0.039095  ,
        2.53259993,  0.75392002,  0.84202999, -0.12890001,  0.11043   ,
       -0.39398   , -0.65667999,  0.0034273 ,  0.04577   , -0.43445   ,
        0.75432998, -0.27877   , -0.030205  ,  0.55124998, -0.18464001,
       -0.66623998,  0.13788   ,  0.99896997,  0.24781001,  1.18610001])

Creating the Keras embedding layer with GloVe weights¶

The embedding_matrix obtained above is used as the weights of an embedding layer in our neural network model.

We set the trainable parameter of this layer to False so that is not trained.

In [80]:
embedding_layer = Embedding(
    input_dim=len(word_index)
    + 1,  # 1 is added because 0 is usually reserved for padding
    output_dim=max_len,  # dimension of the dense embedding
    weights=[embedding_matrix],
    input_length=max_len,  # length of the input sequences
    trainable=False,
)

Creating the LSTM Neural Network Model with embedded GloVe weights¶

In [81]:
# Define a Sequential model
glove_lstm_model = Sequential()

# Add the Embedding layer
glove_lstm_model.add(embedding_layer)

# Add the first Bidirectional LSTM layer
glove_lstm_model.add(Bidirectional(LSTM(150, return_sequences=True)))

# Adding a dropout ratio of 0.4
rnn_model.add(Dropout(0.4))

# Add the second Bidirectional LSTM layer
glove_lstm_model.add(Bidirectional(LSTM(150)))

# Adding a dropout ratio of 0.4
rnn_model.add(Dropout(0.4))

# Add a Dense layer with ReLU activation
glove_lstm_model.add(Dense(128, activation="relu"))

# Adding a dropout ratio of 0.4
rnn_model.add(Dropout(0.4))

# Add the final Dense layer with softmax activation
glove_lstm_model.add(Dense(3, activation="softmax"))
In [82]:
# Compile the model
adam = Adam(lr = 0.0001) # Reduce the learning rate in an attempt to aid model convergence 
glove_lstm_model.compile(loss='categorical_crossentropy',optimizer=adam,metrics=['accuracy'])
In [83]:
#####
# Train the model
#####

# Setup callbacks
callbacks = [
    EarlyStopping(
        monitor="loss", mode="min", verbose=1, patience=10
    ),  # stop the training process once the model stops improving.
    ModelCheckpoint(
        filepath="model_weights2.h5", save_best_only=True, monitor="loss", mode="min"
    ),  # for saving the best model during training
]

num_epochs = 50  # Number of epochs

# Train model
history = glove_lstm_model.fit(
    X_train,
    y_train,
    epochs=num_epochs,
    validation_split=0.2,
    callbacks=callbacks,
    verbose=1,
)
Epoch 1/50
293/293 [==============================] - 52s 165ms/step - loss: 0.7362 - accuracy: 0.6966 - val_loss: 0.6365 - val_accuracy: 0.7403
Epoch 2/50
293/293 [==============================] - 48s 162ms/step - loss: 0.6208 - accuracy: 0.7506 - val_loss: 0.6164 - val_accuracy: 0.7420
Epoch 3/50
293/293 [==============================] - 49s 167ms/step - loss: 0.6020 - accuracy: 0.7579 - val_loss: 0.6126 - val_accuracy: 0.7488
Epoch 4/50
293/293 [==============================] - 48s 163ms/step - loss: 0.5887 - accuracy: 0.7634 - val_loss: 0.6061 - val_accuracy: 0.7475
Epoch 5/50
293/293 [==============================] - 50s 170ms/step - loss: 0.5790 - accuracy: 0.7682 - val_loss: 0.6065 - val_accuracy: 0.7535
Epoch 6/50
293/293 [==============================] - 49s 166ms/step - loss: 0.5713 - accuracy: 0.7715 - val_loss: 0.6021 - val_accuracy: 0.7493
Epoch 7/50
293/293 [==============================] - 49s 166ms/step - loss: 0.5650 - accuracy: 0.7735 - val_loss: 0.6005 - val_accuracy: 0.7535
Epoch 8/50
293/293 [==============================] - 47s 161ms/step - loss: 0.5563 - accuracy: 0.7759 - val_loss: 0.5987 - val_accuracy: 0.7548
Epoch 9/50
293/293 [==============================] - 50s 169ms/step - loss: 0.5467 - accuracy: 0.7813 - val_loss: 0.5933 - val_accuracy: 0.7565
Epoch 10/50
293/293 [==============================] - 49s 168ms/step - loss: 0.5366 - accuracy: 0.7829 - val_loss: 0.5886 - val_accuracy: 0.7544
Epoch 11/50
293/293 [==============================] - 50s 170ms/step - loss: 0.5319 - accuracy: 0.7845 - val_loss: 0.5953 - val_accuracy: 0.7510
Epoch 12/50
293/293 [==============================] - 47s 160ms/step - loss: 0.5193 - accuracy: 0.7894 - val_loss: 0.5817 - val_accuracy: 0.7625
Epoch 13/50
293/293 [==============================] - 48s 164ms/step - loss: 0.5090 - accuracy: 0.7934 - val_loss: 0.5838 - val_accuracy: 0.7638
Epoch 14/50
293/293 [==============================] - 45s 153ms/step - loss: 0.5004 - accuracy: 0.7946 - val_loss: 0.6114 - val_accuracy: 0.7578
Epoch 15/50
293/293 [==============================] - 47s 160ms/step - loss: 0.4937 - accuracy: 0.7975 - val_loss: 0.5829 - val_accuracy: 0.7552
Epoch 16/50
293/293 [==============================] - 53s 179ms/step - loss: 0.4814 - accuracy: 0.8028 - val_loss: 0.5775 - val_accuracy: 0.7587
Epoch 17/50
293/293 [==============================] - 45s 155ms/step - loss: 0.4789 - accuracy: 0.8047 - val_loss: 0.5778 - val_accuracy: 0.7672
Epoch 18/50
293/293 [==============================] - 45s 154ms/step - loss: 0.4697 - accuracy: 0.8117 - val_loss: 0.5802 - val_accuracy: 0.7642
Epoch 19/50
293/293 [==============================] - 44s 151ms/step - loss: 0.4615 - accuracy: 0.8102 - val_loss: 0.5924 - val_accuracy: 0.7629
Epoch 20/50
293/293 [==============================] - 44s 150ms/step - loss: 0.4516 - accuracy: 0.8147 - val_loss: 0.5922 - val_accuracy: 0.7617
Epoch 21/50
293/293 [==============================] - 44s 152ms/step - loss: 0.4456 - accuracy: 0.8170 - val_loss: 0.6074 - val_accuracy: 0.7595
Epoch 22/50
293/293 [==============================] - 47s 160ms/step - loss: 0.4367 - accuracy: 0.8223 - val_loss: 0.6018 - val_accuracy: 0.7672
Epoch 23/50
293/293 [==============================] - 45s 153ms/step - loss: 0.4263 - accuracy: 0.8266 - val_loss: 0.6099 - val_accuracy: 0.7582
Epoch 24/50
293/293 [==============================] - 45s 155ms/step - loss: 0.4195 - accuracy: 0.8338 - val_loss: 0.6085 - val_accuracy: 0.7599
Epoch 25/50
293/293 [==============================] - 46s 156ms/step - loss: 0.4141 - accuracy: 0.8331 - val_loss: 0.6143 - val_accuracy: 0.7621
Epoch 26/50
293/293 [==============================] - 46s 157ms/step - loss: 0.4030 - accuracy: 0.8350 - val_loss: 0.6099 - val_accuracy: 0.7677
Epoch 27/50
293/293 [==============================] - 46s 157ms/step - loss: 0.3906 - accuracy: 0.8423 - val_loss: 0.6358 - val_accuracy: 0.7450
Epoch 28/50
293/293 [==============================] - 46s 156ms/step - loss: 0.3834 - accuracy: 0.8477 - val_loss: 0.6565 - val_accuracy: 0.7557
Epoch 29/50
293/293 [==============================] - 44s 150ms/step - loss: 0.3801 - accuracy: 0.8462 - val_loss: 0.6323 - val_accuracy: 0.7672
Epoch 30/50
293/293 [==============================] - 44s 150ms/step - loss: 0.3687 - accuracy: 0.8533 - val_loss: 0.6281 - val_accuracy: 0.7552
Epoch 31/50
293/293 [==============================] - 46s 157ms/step - loss: 0.3681 - accuracy: 0.8529 - val_loss: 0.6490 - val_accuracy: 0.7548
Epoch 32/50
293/293 [==============================] - 44s 151ms/step - loss: 0.3467 - accuracy: 0.8630 - val_loss: 0.6791 - val_accuracy: 0.7356
Epoch 33/50
293/293 [==============================] - 44s 151ms/step - loss: 0.3442 - accuracy: 0.8646 - val_loss: 0.6570 - val_accuracy: 0.7578
Epoch 34/50
293/293 [==============================] - 45s 153ms/step - loss: 0.3305 - accuracy: 0.8716 - val_loss: 0.6940 - val_accuracy: 0.7458
Epoch 35/50
293/293 [==============================] - 45s 154ms/step - loss: 0.3251 - accuracy: 0.8726 - val_loss: 0.7112 - val_accuracy: 0.7604
Epoch 36/50
293/293 [==============================] - 44s 150ms/step - loss: 0.3170 - accuracy: 0.8729 - val_loss: 0.6941 - val_accuracy: 0.7531
Epoch 37/50
293/293 [==============================] - 47s 160ms/step - loss: 0.3022 - accuracy: 0.8808 - val_loss: 0.7355 - val_accuracy: 0.7437
Epoch 38/50
293/293 [==============================] - 45s 155ms/step - loss: 0.2985 - accuracy: 0.8844 - val_loss: 0.7325 - val_accuracy: 0.7591
Epoch 39/50
293/293 [==============================] - 46s 156ms/step - loss: 0.2909 - accuracy: 0.8850 - val_loss: 0.7277 - val_accuracy: 0.7544
Epoch 40/50
293/293 [==============================] - 47s 160ms/step - loss: 0.2792 - accuracy: 0.8898 - val_loss: 0.7757 - val_accuracy: 0.7587
Epoch 41/50
293/293 [==============================] - 44s 152ms/step - loss: 0.2682 - accuracy: 0.8938 - val_loss: 0.8074 - val_accuracy: 0.7471
Epoch 42/50
293/293 [==============================] - 46s 158ms/step - loss: 0.2571 - accuracy: 0.8996 - val_loss: 0.8324 - val_accuracy: 0.7595
Epoch 43/50
293/293 [==============================] - 45s 152ms/step - loss: 0.2669 - accuracy: 0.8959 - val_loss: 0.7999 - val_accuracy: 0.7505
Epoch 44/50
293/293 [==============================] - 45s 154ms/step - loss: 0.2455 - accuracy: 0.9057 - val_loss: 0.9056 - val_accuracy: 0.7518
Epoch 45/50
293/293 [==============================] - 45s 155ms/step - loss: 0.2353 - accuracy: 0.9118 - val_loss: 0.8951 - val_accuracy: 0.7617
Epoch 46/50
293/293 [==============================] - 50s 172ms/step - loss: 0.2247 - accuracy: 0.9147 - val_loss: 0.8693 - val_accuracy: 0.7544
Epoch 47/50
293/293 [==============================] - 45s 153ms/step - loss: 0.2231 - accuracy: 0.9134 - val_loss: 0.8986 - val_accuracy: 0.7514
Epoch 48/50
293/293 [==============================] - 46s 158ms/step - loss: 0.2140 - accuracy: 0.9181 - val_loss: 0.8952 - val_accuracy: 0.7544
Epoch 49/50
293/293 [==============================] - 46s 157ms/step - loss: 0.2080 - accuracy: 0.9198 - val_loss: 0.8959 - val_accuracy: 0.7591
Epoch 50/50
293/293 [==============================] - 45s 154ms/step - loss: 0.2016 - accuracy: 0.9239 - val_loss: 0.9788 - val_accuracy: 0.7548
In [84]:
# Evaluate the best model on the test set
loss, GloVe_LSTM_testscore = glove_lstm_model.evaluate(X_test, y_test, verbose=0)

# Evaluate the best model on the trainset set
loss, GloVe_LSTM_trainscore = glove_lstm_model.evaluate(X_train, y_train, verbose=0)

accuracy_scores_train.append(GloVe_LSTM_trainscore)
accuracy_scores_test.append(GloVe_LSTM_testscore)

print("Train Acurracy Score: ", accuracy_scores_train)
print("Test Acurracy Score: ", accuracy_scores_test)
Train Acurracy Score:  [0.9917824300528273, 0.9942427790788446, 0.9405118823051453, 0.8915518522262573]
Test Acurracy Score:  [0.7535371976266545, 0.7470400728597449, 0.734337568283081, 0.7665182948112488]
In [85]:
def plot_training_hist(history):
    """Function to plot history for accuracy and loss"""

    fig, ax = plt.subplots(1, 2, figsize=(10, 4))
    # first plot
    ax[0].plot(history.history["accuracy"])
    ax[0].plot(history.history["val_accuracy"])
    ax[0].set_title("Model Accuracy")
    ax[0].set_xlabel("epoch")
    ax[0].set_ylabel("accuracy")
    ax[0].legend(["train", "validation"], loc="best")
    # second plot
    ax[1].plot(history.history["loss"])
    ax[1].plot(history.history["val_loss"])
    ax[1].set_title("Model Loss")
    ax[1].set_xlabel("epoch")
    ax[1].set_ylabel("loss")
    ax[1].legend(["train", "validation"], loc="best")


plot_training_hist(history)

Observations

  • The accuracy score of the LSTM Classifier with GloVe is 76% on unseen data.
  • The model is overfitting. Train set is 89% accuracy and validation & test score about 76%.
  • The model is still not converging even though we reduced the learning rate. A lot more data is needed to train the model so as to learn patterns in unseen data.

Final Model Selection¶

In [86]:
# Update the first two rows of the DataFrame with accuracy scores
accuracy_df.loc[0] = accuracy_scores_train
accuracy_df.loc[1] = accuracy_scores_test

# Give the index a descriptive text
index = ["Accuracy - Train Set", "Accuracy - Test Set"]
accuracy_df.index = index

# Display the updated DataFrame
accuracy_df
Out[86]:
Random Forest with BoW Random Forest with TF-IDF LSTM with Keras Tokenizer LSTM with GloVe embedding
Accuracy - Train Set 0.991782 0.994243 0.940512 0.891552
Accuracy - Test Set 0.753537 0.747040 0.734338 0.766518

Conlusion¶

  • The Random Forest with BoW and LSTM with GloVe embedding models have the best accuracy scores on test set - 75.35% & 76.65% redpectively.
  • However we will select LSTM with GloVe embedding as our Final Model because it has the highest accuracy score of 76.65% on unseen data and also lesser overfitting.

Key Insights and Business Recommendations¶

  1. Our analysis shows that 99.73% of airline_sentiment_gold is missing. This is the Gold standard sentiment label or the groundtruth label. The Business should invest in having the Gold standard sentiment label which is a carefully curated and annotated label where human annotators have manually assigned sentiment labels (such as positive, negative, or neutral) to Tweet samples. These labels are considered reliable and accurate, and they serve as a benchmark for training and evaluating machine learning models designed to automatically classify sentiment in text data with very high accuracy on unseen data. This would have really helped both our LSTM with Keras Tokenizer & LSTM with GloVe embedding models converge much better.

  2. Customer Service Issue dominated the negativereason. The Airlines should invest in Customer Service training and Infracstructure like deploying AI & NLP in the call-center for speedy and efficient resolution of issues.

  3. Only 16% of the Tweets are positive. The airlines can encourage and reward customers for sharing positive experience as humans mostly share what they are unhappy about.

  4. Airlines should encourage retweets of posivite sentiment because Retweets (and shares) are perhaps the most important feature to track on Social Media in Sentiment Analysis because it is a measure of how impactful the sentiment is, how much users relate & agree with the sentiment and how far it will spread among users. Our analysis shows that:

    • Southwest has the most retweet of positive sentiments followed by Virgin America
    • US Airways has the most retweet of negative sentiment followed by Delta
    • Neutral sentiments do not appear to get retweets.

Further Analysis and Modeling:¶

  1. Geospatial Analysis can be done to drills down on airline sentiments by location. Location data (e.g., 'tweet_location' or 'user_timezone') can be used to create geospatial visualizations to understand where tweets are originating from, providing local intelligence.

  2. Since Southwest has the most retweet of positive sentiments followed by Virgin America. The social media strategies of these 2 organization should be understudied and analyzed.

  3. It's interesting to see "jetblue" as a more dominant word than many of the airlines represented in the airlines column...considering that jetblue is not a part of our analysis. Should Jetblue have been included in analysis? Are there relationships that needs to be unearthed? Is more data engineering and preparation needed? These are questions that needs answers. Further analysis needed.

  4. More data, preferably "gold sentiments" Ground Truth labelled data needed to improrove the accuracy and convergence of our models.